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Abstract 

The primary role of a protein coding gene is to encode amino acids. Therefore, synonymous sites of 
codons, which do not change the encoded amino acid, are regarded as evolving neutrally. However, if a 
certain region of a protein coding gene contains a functional nucleotide element (e.g. splicing signals), 
synonymous sites in the region may have selective pressure. The existence of such elements would be 
detected by searching regions of low nucleotide substitution. We explored invariant nucleotide sequences 
in 10 790 orthologous genes of six mammalian species (Homo sapiens, Macaca mulatto, Mus musculus, 
Rattus norvegicus, Bos taurus, and Canis familiaris), and extracted 4150 sequences whose conservation 
is significantly stronger than other regions of the gene and named them significantly conserved coding 
sequences (SCCSs). SCCSs are observed in 2273 genes. The genes are mainly involved with development, 
transcriptional regulation, and the neurons, and are expressed in the nervous system and the head and 
neck organs. No strong influence of conventional factors that affect synonymous substitution was 
observed in SCCSs. These results imply that SCCSs may have double function as nucleotide element 
and protein coding sequence and retained in the course of mammalian evolution. 
Key words: mammal; protein coding; nucleotide conservation 



1 . Introduction 

The neutral theory of molecular evolution 1,2 pre- 
dicts that synonymous sites of codons are evolving 
faster than non-synonymous sites because of the 
smaller selective pressure. This is true in general; 
however, several factors are known to influence on a 
certain region of a coding sequence and suppress 
synonymous substitution. 

One of the well-known factors is the codon bias 
towards optimum codons. Optimum codons reflect 
the composition of genomic tRNA pool. 3-5 Because 
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optimum codons are advantageous for fast and accu- 
rate translation, highly expressed or biologically 
important genes would have more optimum codons 
than others. 6-8 Changes from an optimum codon 
to a non-optimum codon will be suppressed in such 
genes. Because optimum codons are similar in 
closely related species, highly expressed genes tend 
to show similar codon usage; therefore synonymous 
substitution is lowered. In fact, the requirement for 
translational efficiency or accuracy enhances the 
optimum codon usage and suppresses nucleotide 
changes through purifying selection. 5,6,9-11 Codon 
bias towards optimum codons is strong in fast- 
growing organisms such as Escherichia coli or 
Saccharomyces cerevisiae, but generally weak in 
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species with slow growth rate small population 

size. 12 ' 13 

Another factor is exonic splicing enhancer or silen- 
cer, which are splicing signals embedded in 
exons. 14 ' 15 Existence of such elements lowers the 
synonymous substitution. 1 6,1 7 In addition, ultracon- 
served elements (UCEs), which majorly reside in 
non-protein coding regions, sometimes extend to 
coding regions. 18 In mammals, UCEs are reported to 
exist near to or overlap with genes associated with 
nucleotide binding, transcriptional regulation, RNA 
recognition motif, zinc finger domain, and homeobox 
domain. 18-20 Hox genes also contain long conserved 
nucleotide regions other than UCEs outside the 
homeobox domain. 21 

Although the primary role of protein coding region 
is to encode amino acids, there may be also functional 
nucleotide elements embedded within coding 
regions. For example, transcription-factor-binding 
sites are found in coding regions, 22 messenger RNAs 
are targeted by various post-transcriptional regu- 
lations, 23 and the requirement for a specific second- 
ary structure for RNA editing decreases synonymous 
substitution. 24,25 

Functional nucleotide elements are extensively 
explored in the non-coding regions, 26-30 but less 
studies have been done to explore probable functional 
elements within the coding regions. We extracted sig- 
nificantly conserved coding sequences (SCCSs) from 
orthologous genes of six mammalian species 
(human, rhesus macaque, mouse, rat, cow, and dog), 
and compared genes containing SCCSs and genes 
without SCCSs. Analyses on gene ontology (GO), 
InterPro codes, and KEGG pathways enlighten differ- 
ence between the two gene groups. We also investi- 
gated RNA secondary structures, codon preference, 
GC content, exonic splicing signals, and gene 
expression of SCCSs to survey the influence of these 
factors. 



2. Materials and methods 

2.1 . Genome data 

We obtained peptide and nucleotide sequences of 
orthologous genes of six mammalian species {Homo 
sapiens, Macaca mulatta, Mus musculus, Rattus norvegi- 
cus, Bos taurus, and Canis familiaris) from Ensembl 
database 31 version 54 (http://May2009. archive. 
ensembl.org/index.html). These species are selected 
considering genome data quality and evolutionary 
diversity. We selected one-to-one type single copy 
orthologs and compiled 10 790 orthologous gene 
sets (Supplementary file S1). We then constructed 
multiple alignments of peptide sequences using 
ClustalW 32 and constructed nucleotide alignments 
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based on the peptide alignments. From the nucleo- 
tide alignments, we extracted sequences that are 
invariant among all the species. 

2.2. Identification of SCCSs 

We performed permutation simulation to identify 
SCCSs, or abbreviated as SCCSs, which are invariant 
longer than 1 0 codons. This length is set to confine 
the permutation run time within a feasible range. 
For an N-codon long alignment, we generated a 
non-redundant series of random numbers from 1 to 
N, and permuted codon columns (rows of codons in 
the same site of the alignment) according to the gen- 
erated random numbers. In this process, gap sites are 
fixed and the rest of the sites are permuted. Then the 
length and numbers of invariant sequences in the 
permuted alignment are counted. We repeated this 
process 500 000 times per ortholog set and took 
averages of the frequency of invariant sequences. We 
used the length and averaged frequency of invariant 
sequences obtained from the permutation results as 
random expectation, and evaluated the probability 
of invariant sequences in the original alignment 
based on the expectation. This approach helps ident- 
ify sequences whose conservation is rare to occur in 
the substitution background of each alignment. 
Multiple testing correction of p-values is done by 
FDR (false discovery rate). 33 Then we identified invar- 
iant sequences longer than 1 0 codons and P< 0.01 
as SCCSs. 

2.3. Analysis on GO, InterPro, and KEGG pathways 
Protein coding genes that contain at least one SCCS 

is named SCCS genes and those that do not contain 
an SCCS is named non-SCCS genes. We used Fatigo 
web service (http://babelomics.bioinfo.cipf.es/ 
functional.html) to identify GO terms, InterPro 
codes, and KEGG pathways that are significantly 
enriched with the SCCS gene group or with the non- 
SCCS gene group. Fatigo accepts a list of Ensembl 
gene IDs as input and provides P-values for enrich- 
ment of the above terms. P-values are calculated by 
Fisher's exact test and corrected by FDR. We used 
Ensembl gene IDs of H. sapiens as input and per- 
formed two-tailed comparison between SCCS genes 
and non-SCCS genes. 

2.4. Analysis on preferred codons and average codon 
degeneracy of SCCSs 

We defined preferred codons as the most frequently 
used codons for a given amino acid referring to Codon 
Usage Database (http://www.kazusa.or.jp/codon/ 
index.html) provided by Kazusa DNA Research 
Institute. Because the codon usage pattern was 
similar among the six species we used, the codon 
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table of human was used as the representative. We 
counted the number of preferred codons in a SCCS 
and divided it by the codon length of the SCCS, and 
then used the quotient as the ratio of preferred 
codons. The average codon degeneracy is calculated 
by summing up the degeneracy of each codon and 
dividing it by the codon length of the SCCS. 

2.5. Prediction of RNA secondary structures 

We computationally predicted secondary structures 
and free folding energy of SCCSs using Vienna RNA 
software package 34 (http://www.tbi.univie.ac.at/ 
~ivo/RNA/). Because folding free energy varies 
depending on the sequence length, we constructed 
free energy distribution by 1000 randomly chosen 
sequences for each length (33-246 nucleotides). 
The P-value for a given free energy was evaluated 
based on these distributions. Multi testing correction 
is done by FDR. 

2.6. Evaluation of exonic splicing enhancer density 
We obtained 238 hexamers from RESCUE-ESE Web 

Server 34 as candidates of exonic splicing enhancers. 
We counted the number of the hexamers in the 
SCCSs and the rest of the coding regions of the 
1 0 790 genes (human sequences). We also measured 
the total nucleotide length of the SCCS and the other 
regions and applied the chi-square test for the ratio of 
hexamers. 

2.7. Analysis on gene expression 

We used EGenetics (http://www.nhmrc.gov.au/ 
your_health/egenetics/index.htm) to investigate 
gene expression of SCCS genes and non-SCCS genes. 
Human anatomical system data, which give infor- 
mation about in which organs a gene is expressed, 
were obtained from EGenetics database by Ensemble 
Biomart. We counted how many of SCCS genes and 
non-SCCS genes are expressed in each organ and 
divided the numbers by the total number of SCCS 
genes and non-SCCS genes, respectively. Then we per- 
formed the Fisher's exact test to evaluate whether the 
difference between SCCS and non-SCCS genes is sig- 
nificant. All P-values were corrected by FDR. 

3. Results 

3.1. Identification of SCCSs 

If an alignment has a high ratio of conservation, 
long invariant sequences may occur easily, and vice 
versa. Therefore, the rareness of invariant sequences 
differs depending on the conservation background 
in each alignment. To compensate this, we performed 
permutation simulation. The idea of permutation is 



to count the length and frequency of invariant 
sequences after random change of loci and use the 
result as the random expectation. 

We used the frequency distribution of invariant 
sequences in the permuted alignments as the 
random expectation and evaluated probability of 
invariant sequences in the original alignments, and 
defined invariant sequences longer than 1 0 codons 
and whose probability is below 0.01 (corrected by 
FDR) as SCCSs. 

In total, 41 50 SCCSs (1 92 306 bp) were obtained 
from 2273 alignments of 1 0 790 orthologous gene 
sets (Supplementary file S2). This occupies 0.94% of 
the coding region of the 1 0 790 genes. Table 1 
shows the number of SCCSs per gene and the 
number of genes that contain that number of SCCSs. 
Figure 1 is a graph of lengths and numbers of SCCSs 
(grey bars). Black dots indicate the random expec- 
tation obtained from the permuted alignments. 

In the permuted alignments, there are 141 
sequences (their total length is 5550 bp) whose prob- 
ability is below 0.01 , which means the region size of 
SCCS is 35-fold larger than this expectation 
(Supplementary file S3). The x 2 test between SCCSs 



Table 1 . Number of SCCSs in coding genes 





Number of SCCSs 
(per gene) 


Number of genes 


SCCS genes 


1 


1 366 




2 


475 




3 


219 




4 


1 05 




5 


42 




>5 


66 


Non-SCCS genes 




851 7 



1000 



100 




1 



01 LUU1III M XI Ul il IUI 111UIU UUllll U U 

33 63 93 123 153 183 213 243 

Sequence length 

Figure 1. The length and number of SCCSs. X and Y-axes represent 
the length and frequency of SCCSs, respectively. Grey bars show 
the length and the number of SCCSs. Black dots indicate the 
number of sequences with probability below 0.01 in the 
permuted alignments. 
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and the permutation result showed a significant 
difference (P< 2.2E-16). 

If invariant and variant sites distributed randomly in 
the original alignment, the frequency of invariant 
sequences in the permuted alignments would 
show similar frequency to the original alignment 
because of the randomness at the start point. The 
difference before and after the permutation 
suggests that the distribution of invariant sites in the 
original alignments is rather clustered than being 
random. 
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3.2. GO, InterPro, and KEGG pathways enriched in 
SCCS containing genes 

We used Fatigo web service to investigate difference 
in GO, InterPro codes, and KEGG pathways between 
the SCCS genes and non-SCCS genes. The difference 
was evaluated by the two-tailed Fisher's test, and P- 
values were corrected by FDR. Tables 2 and 3 show 
GO terms, InterPro codes, and KEGG pathways signifi- 
cantly enriched (P< 0.01) in SCCS genes. 

The all terms in GO Biological process section is 
related to developmental process. One term of 
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Table 2. GO terms significantly (P< 0.01 ) enriched with SCCS genes 



Terms P* In SCCS containing In non-SCCS containing Fold c 

genes (%) a genes (%) b 

Biological process 



GO:00351 36: Forelimb morphogenesis 


7.75E- 


05 


0.53 


0.04 


1 3.25 


GO:00351 1 5: Embryonic forelimb morphogenesis 


2.50E- 


04 


0.48 


0.04 


1 2.00 


GO:0060070: Canonical Wnt receptor signalling pathway 


2.1 1 E- 


03 


0.4 


0.04 


1 0.00 


GO:00351 37: Hindlimb morphogenesis 


4.88E- 


04 


0.53 


0.06 


8.83 


GO:0001 702 


Gastrulation with mouth forming second 


1.78E- 


03 


0.44 


0.05 


8.80 


GO:0009954: Proximal/distal pattern formation 


3.80E- 


04 


0.57 


0.07 


8.1 4 


GO:0031 128 


Developmental induction 


1.03E- 


03 


0.53 


0.07 


7.57 


GO:0048593: Camera-type eye morphogenesis 


1.22E- 


05 


0.88 


0.1 3 


6.77 


GO:0021 51 0: Spinal cord development 


2.37E- 


04 


0.75 


0.1 3 


5.77 


GO:0031 01 6: Pancreas development 


1.85E- 


03 


0.62 


0.1 2 


5.1 7 


ellular components 












GO:0014704 


Intercalated disc 


8.43E- 


03 


0.35 


0.04 


8.75 


GO:00431 98 


Dendritic shaft 


2.38E- 


03 


0.48 


0.06 


8.00 


CO:0030425 


Dendrite 


2.38E- 


03 


2.29 


1 .1 


2.08 


GO:0043025 


Neuronal cell body 


2.38E- 


03 


2.24 


1 .1 1 


2.02 


GO:001 5629 


Actin cytoskeleton 


1.1 5E- 


03 


3.39 


1.8 


1.88 


GO:0043005 


Neuron projection 


1.1 5E- 


03 


4.09 


2.32 


1.76 


lolecular function 












GO:0035254: Glutamate receptor binding 


3.41 E- 


03 


0.35 


0.02 


1 7.50 


GO:0005072 


Transforming growth factor beta receptor, 


8.92E- 


03 


0.31 


0.02 


1 5.50 


cytoplasmic mediator activity 












GO:0004843 


Ubiquitin-specific protease activity 


3.71 E- 


03 


0.48 


0.07 


6.86 


GO:0031625 


Ubiquitin protein ligase binding 


6.44E- 


03 


0.62 


0.1 4 


4.43 


GO:0003725 


Double-stranded RNA binding 


6.44E- 


03 


0.62 


0.1 4 


4.43 


GO:0042054 


Histone methyltransferase activity 


8.64E- 


03 


0.57 


0.1 3 


4.38 


GO:0050825 


Ice binding 


6.50E- 


1 1 


2.86 


0.76 


3.76 


GO:0004221 


Ubiquitin thiolesterase activity 


4.1 8E- 


04 


1.01 


0.27 


3.74 


CO:00051 99 


Structural constituent of cell wall 


1.34E- 


03 


0.92 


0.25 


3.68 


GO:0003682 


Chromatin binding 


4.08E- 


08 


2.1 5 


0.59 


3.64 



a The percentages of SCCS genes that have the GO term. 
b The percentages of non-SCCS genes that have the GO term. 

The fold of 'a' to 'b'. Terms are listed in the descending order of the fold difference. Terms with the highest 1 0-folds are 
shown for Biological Process and Molecular function. 
*Probability for enrichment of the GO term in the SCCS group. 
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Table 3. InterPro and KEGG terms significantly (P< 0.01) enriched with SCCS genes 
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Terms 



In SCCS containing genes In non-SCCS containing 
(%) a genes (%) b 



Fold c 



InterPro 

IPR00361 9: MAD homology 1 , Dwarf in-type 1.55E-04 

IPR001 827: Homeobox protein, antennapedia type, 1 .77E-03 
conserved site 

IPR000569: HECT 1.26E-05 

IPR002077: Voltage-dependent calcium channel, 5.65E-04 
alpha-1 subunit 

IPR010982: Lambda repressor-like, DNA-binding 1.49E-03 
IPR002343: Paraneoplastic encephalomyelitis antigen 9.91 E-05 

IPR01 8359: Bromodomain, conserved site 9.75E-04 

IPR001487: Bromodomain 1.94E-06 

IPR01 7995: Homeobox protein, antennapedia type 7.36E-04 

IPR004088: K Homology, type 1 7.36E-04 
KEGG 

hsa0301 8: RNA degradation 1.05E-03 

hsa04340: Hedgehog signalling pathway 6.75E-03 

hsa04520: Adherens junction 4.55E-03 

hsa0521 1: Renal cell carcinoma 9.88E-03 

hsa04120: Ubiquitin-mediated proteolysis 1.05E-03 

hsa0431 0: Wnt signalling pathway 2.22E-04 

hsa04360: Axon guidance 3.47E-04 

hsa0481 0: Regulation of actin cytoskeleton 1.07E-03 

hsa0401 0: MAPK signaling pathway 3.23E-04 



0.44 
0.44 

0.75 
0.53 

0.48 

0.7 

0.57 

0.97 

0.62 

0.62 

0.92 

0.88 

1.06 

0.92 

1.5 

2.07 

1 .8 

2.07 

3.08 



0.01 
0.04 

0.07 
0.05 

0.05 
0.08 
0.07 
0.1 2 
0.08 
0.08 

0.24 
0.29 
0.3 7 
0.33 
0.57 
0.79 
0.69 
0.94 
1.5 



44.00 
1 1.00 

1 0.71 
1 0.60 

9.60 
8.75 
8.14 
8.08 
7.75 
7.75 

3.83 
3.03 
2.86 
2.79 
2.63 
2.62 
2.61 
2.20 
2.05 



a The percentages of SCCS genes that have the InterPro or KEGG term 

b The percentages of non-SCCS genes that have the InterPro or KEGG term. 

c The fold of 'a' to 'b'. Terms are listed in the descending order of the fold difference. 

*Probability for enrichment of the InterPro or KEGG terms in the SCCS group. 



Cellular components (GO:0014704) mediate mech- 
anical and electrochemical integration between cardi- 
omyocytes and the rest of the five (GO:0043198, 
GO:0030425, GO:0043025, GO:001 5629, and 
GO:0043005) have an association with the neuron. 
In the molecular function category, three terms 
(GO:0004843, GO:0031 625, and GO:0004221) 
are related to the ubiquitin system, two 
(GO:00042054 and GO:00003682) are associated 
with chromatin. Ubiquitins are known to be involved 
not only with protein degeneration but also with 
signal transduction, chromatin modification, and cell 
cycle. 

Of the ten Interpro codes listed in Table 3, 
IPR001827 is related to ubiquitin and IPR002077 
represents calcium channel and other eight are all 
associated with DNA or RNA-binding functions that 
mediate transcriptional regulation or chromatin 
modification. 

Tow KEGG pathways (hsa04340 and hsa0431 0) in 
Table 3 are developmental signalling pathways, 



hsa04120 is related to the ubiquitin system, and 
hsa04360 is involved in axon guidance, which well 
corresponds with the GO and Interpro terms. 

Table 4 shows the terms that are significantly scarce 
in SCCS genes. In contrast to Tables 2 and 3, majority 
of the terms are involved with metabolic processes. 
Fatigo can also explore enrichment of micro RNA 
target and transcription-factor-binding sites; no sig- 
nificant item was found. 



3.3. Overlap with UCEs 

UCEs are defined as nucleotide sequences that are 
absolutely conserved longer than 200 bp between 
orthologous regions of the human, rat and 
mouse. 18,19 UCEs are found in both coding and 
non-coding regions. Precedent studies report that 
genes with low synonymous substitution or genes 
overlapped with UCEs are associated with DNA 
binding, RNA binding, transcription activity, and 
Homeobox. 18,19 In our 10 790 genes, 4009 bp in 
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Table 4. GO, InterPro and KEGG terms significantly (P< 0.01) scarce in SCCS genes 



Terms 


p* 




In SCCS containing 
genes (%) a 


In non-SCCS containing 
eenes (%) b 

vyuy 


Fold c 


Biological process 












GO:0043039: tRNA aminoacylation 


6.1 5E- 


03 


0.09 


0.63 


0.1 4 


GO:0006725: Cellular aromatic compound metabolic process 


7.31 E- 


03 


0.48 


1.3 


0.37 


GO:0051 1 86: Cofactor metabolic process 


6.05E- 


03 


0.66 


1.59 


0.42 


GO:00551 14: Oxidation-reduction process 


5.86E- 


04 


2.1 1 


3.87 


0.55 


GO:001 9752: Carboxylic acid metabolic process 


3.54E- 


04 


2.42 


4.34 


0.56 


GO:0006082: Organic acid metabolic process 


3.54E- 


04 


2.42 


4.35 


0.56 


GO:0044255: Cellular lipid metabolic process 


8.60E- 


03 


3.3 


4.98 


0.66 


Cellular component 












GO:001 9866: Organelle inner membrane 


2.3 8E — 


03 


0.66 


1 .72 


0.38 


GO:0044429: Mitochondrial part 


1.27E- 


03 


1.63 


3.24 


0.50 


GO:000561 5: Extracellular space 


2.87E- 


03 


2.29 


3.93 


0.58 


Molecular function 












GO:0001 595: Angiotensin receptor activity 


8.64E- 


03 


0 


0.4 


0.00 


GO:001 661 6: Oxidoreductase activity, acting on the CH-OH 
group of donors, NAD or NADP as acceptor 


6.42E- 


03 


0.1 8 


0.82 


0.22 


GO:001 661 4: Oxidoreductase activity, acting on CH-OH group 
of donors 


8.92E- 


03 


0.26 


0.97 


0.27 


Interpro 












IPR0021 98: Short-chain dehydrogenase/reductase SDR 
Kegg 

hsa04060: Cytokine-cytokine receptor interaction 


8.96E- 


03 


0 


0.46 


0.00 


2.86E- 


03 


0.57 


1.56 


0.37 



a The percentages of SCCS genes that have the GO, InterPro or KEGG term. 
b The percentages of non-SCCS genes that have the GO, InterPro or KEGG term. 
The fold of 'a' to 'b'. Terms are listed in the ascending order of the fold difference. 
*Probability for enrichment of the GO, InterPro or KEGG terms in the SCCS group. 



29 genes are found to overlap with UCEs. In the 
4009 bp region, 2835 bp in 22 genes overlap with 
SCCSs (Supplementary file S4). Because SCCSs are 
conserved in six mammalian species, including the 
three species referred for UCEs, the 2835 bp reflect 
conservation in other three species (macaque, cow, 
and dog). We surveyed nucleotide sites in the 
1 0 790 genes that are conserved among human, rat, 
and mouse, and evaluated how many of them are 
also conserved in macaque, cow, and dog. The 
resulted ratio is 0.751 ; therefore the expected conser- 
vation is 3035 bp. This matches well with the obser- 
vation. The regions overlapped with UCEs make only 
1.47% of the entire SCCS regions. Other SCCSs 
convey shorter but deeper conservation than UCEs. 

3.4. SCCSs that form stable RNA secondary structures 
There are cases that a secondary structure of mRNA 
conveys functions. 24,25 We examined secondary struc- 
tures and free energy of the SCCSs using Vienna RNA 
package. We found three SCCSs whose folding 
energy were significantly low (Table 5). 



Polg encodes a catalytic subunit of mitochondrial 
DNA polymerase POLG. The POLG protein is the only 
polymerase known to be involved in replication of 
mtDNA. Gal3st3 encodes a member of the galac- 
tose-3-O-sulfotransferase protein family. This protein 
exists on the membrane of Golgi apparatus. Smarcd3 
encodes a protein of SWI/SNF family, whose 
members display helicase and ATPase activities. This 
protein is thought to regulate transcription of target 
genes by altering the chromatin structure around 
those genes. 



Table 5. Genes containing SCCS with significantly (P< 0.001) low 
free folding energy 





Gene 


Length 


Free 
energy 


polg 


DNA polymerase subunit gamma-1 


36 


-19.9 


gal3st3 


Galactose-3-O-sulfotransferase 3 


36 


-22.6 


smarcd3 


SWI/SNF-related matrix-associated 
actin-dependent regulator of 
chromatin subfamily D member 


39 


-23.9 



The gene names are represented by those of human. 
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Figure 2. The probability density of the folding free energy. This 
graph shows the probability density of the folding free energy 
for 36 and 39 nucleotide-long sequences. The three SCCSs 
with significantly low free energy are indicated on the graph. 
Probability density was created using software package R. 

Figure 2 shows the probability density of the folding 
free energy constructed by randomly extracted 
sequences. Each line shows the free energy of the 
sequences of the same length as the above three 
SCCSs. Gene names on the lines represent free 
energy of the SCCSs. This figure suggests that these 
three SCCSs have extremely low free energy and that 
these regions will form stable secondary structures. 
These SCCSs may be conserved because of the 
requirement for the secondary structures. However, 
substitution restriction of this type might be modest 
because a secondary structure can be retained by 
another combination of nucleotides as far as the com- 
plementarity is maintained. 



3.5. The density of exonic splicing enhancers in SCCSs 
and in the other coding regions 
One of the well-known functional nucleotide 
elements in the coding region is splicing signals. We 
obtained 2 38 hexamers from RESCUE-ESE Web 
server as candidates of exonic splicing enhancers, 
and counted the number of hexamers in SCCSs and 
non-SCCS regions of the 1 0 790 genes (Table 6). 
Splicing signals in non-SCCS regions are counted on 
human sequences. We observed 20 420 hits of 
signals in 192 306 bp of SCCS regions and 
2 183 544 hits in 205 666 520 of non-SCCS 
regions. The number of the signals per nucleotide is 
both 0.106 for SCCS and non-SCCS, and there was 
no significant difference (P = 0.99, x 2 test). 
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Table 6. Splicing signals in SCCS and non-SCCS regions of 1 0 790 
genes 





Region size (bp) 


#SpIicing signals 


Per nucleotide 


sees 


1 92 306 


20420 


0.106 


Non-SCCS 


20 566 520 


2 1 83 544 


0.106 



Splicing signals in non-SCCS regions are counted on human 
sequences. 



3.6. Gene expression 

We investigated the difference of gene expression 
between SCCS genes and non-SCCS genes referring 
to anatomical system data of EGenetics, which give 
qualitative information about in what organs a gene 
is expressed. We counted the number of SCCS genes 
and non-SCCS genes expressed in the organs and per- 
formed the Fisher's exact test as described in Materials 
and method section. 

We compared the percentages of genes that are 
expressed in each organ. Table 7 shows organs in 
which significantly higher percentage of SCCS genes 
are expressed compared with non-SCCS genes. In 
general, SCCS genes are expressed in a wider variety 
of organs. This observation agrees with a previous 
study. 35 Only in medulla oblongata and trophoblast, 
non-SCCS genes showed significantly higher percen- 
tage than SCCS genes (data not shown). 

Seven organs (amygdala, spinal cord, cerebellum 
cortex, cerebellum, frontal lobe, pituitary grand, and 
sympathetic chain) in Table 7 are related with the 
nervous system and six organs (cochlea, trabecular 
meshwork, hypopharynx, larynx, tongue, and 
thyroid) are associated with head and neck. 

3.7. Preferred codons, GC content, and codon 
degeneracy of SCCSs 

Codon usage biases towards optimum codons are 
known to suppress synonymous substitution. 
Optimum codons reflect the composition of the 
genomic tRNA pool and are advantageous for trans- 
lation efficiency or accuracy. As the approximate 
index of optimum codons, we used preferred 
codons, or most frequently used codon for an amino 
acid, and evaluated the preferred codon fraction in 
SCCSs. In SCCS regions, 28 1 88 of 64 102 codons 
are preferred codons and in non-SCCS regions, 
2 961 482 of 6 855 505 codons are preferred 
codons. The ratio of preferred codons in SCCS and 
non-SCCS regions are 0.440 and 0.432, respectively. 
The difference is significant at the 0.05 significance 
level (P=0.013) but the difference of the ratios is 
merely 0.008. We also observed that the ratio of pre- 
ferred codon decreases as the length of SCCS increases 
(Fig. 3). Judging from these results, SCCSs are unlikely 
to be retained solely by codon preference. 
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Table 7. Organs in which significantly (P< 0.001) higher percentage of SCCS genes are expressed compared with non-SCCS genes 



Orga n 








M nn CffC 
INUI l - j<^L,j> 




TOIU 


^Expressed 


Percentage 3 


^Expressed 


Pe rcentage^ 


Amygdala 


1 .32E-1 1 


201 


9.86 


318 


5.37 


1 .84 


Cochlea 


1 .24E-22 


436 


21.39 


723 


1 2.21 


1.75 


Small intestine 


5.28E-03 


49 


2.40 


86 


1.45 


1.66 


Amnion 


1.03E-04 


1 02 


5.00 


1 83 


3.09 


1.62 


Amniotic fluid 


9.1 2E-07 


1 75 


8.59 


321 


5.42 


1 .58 


Spinal cord 


7.21 E-05 


1 29 


6.33 


243 


4.10 


1.54 


Artery 


9.80E-05 


1 33 


6.53 


254 


4.29 


1 .52 


Cerebellum cortex 


2.74E-03 


83 


4.07 


1 60 


2.70 


1.51 


Cerebellum 


3.62E-08 


277 


1 3.59 


543 


9.1 7 


1 .48 


Trabecular meshwork 


1 .38E-05 


199 


9.76 


399 


6.74 


1.45 


Frontal lobe 


1 .63E-28 


899 


44.1 1 


1 803 


30.45 


1 .45 


Hypopharynx 


1 .75E-06 


246 


1 2.07 


497 


8.39 


1 .44 


Pituitary gland 


1 .67E-09 


421 


20.66 


877 


14.81 


1 .39 


Sympathetic chain 


9.65E-06 


271 


1 3.30 


575 


9.71 


1 .37 


Breast 


5.95E-30 


1 1 33 


55.59 


2430 


41.04 


1.35 


Larynx 


9.05E-1 3 


642 


31.50 


1 385 


23.39 


1 .35 


Tongue 


4.55E-07 


377 


1 8.50 


81 6 


1 3.78 


1.34 


Smooth muscle 


1 .36E-05 


307 


1 5.06 


670 


1 1.32 


1 .33 


Thyroid 


3.36E-23 


1 076 


52.80 


2375 


40.1 1 


1 .32 


Adrenal gland 


4.51 E-06 


362 


1 7.76 


801 


1 3.53 


1 .31 



a The percentages of SCCS genes that are expressed in the organ. 

b The percentages of non-SCCS genes that are expressed in the organ. 

c The fold of 'a' to 'b' Terms are listed in the descending order of the fold difference. 

*Probability for enrichment of the expressed genes in the SCCS group. 



In mammals, of the influence GC content on 
nucleotide change as a result of CpG hyper mutability. 
We investigated GC content of SCCSs. GC content in 
the first (GC1), second (GC2), and third position 
(GC3) of codons show different patterns along the 
sequence length (Fig. 4). GC1 is mostly constant but 
GC2 increases while GC3 decreases as the length of 
SCCS increases. Because mammalian genomes prefer 
GC-ending codons, the decrease of GC3 corresponds 
to the decrease of preferred codons. The decrease of 
GC3 seems to be complementary with the increase 
of GC2 because GC content as a whole is constant 
(Supplementary Fig. S1). 

Conservation of SCCSs may occur by chance in a 
region where amino acid constraint is strong and 
codon degeneracy is low. We investigated codon degen- 
eracy of SCCSs to examine this possibility. The average 
degeneracy is between three and four and increases 
as the sequence length increases (Fig. 5). This result 
suggests that even if the first and second positions of 
codons are restricted, the third position has enough 
freedom to change. Therefore, it is unlikely that SCCSs 
are conserved because of the amino acid constraint 
combined with the low degeneracy of codons. 
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Figure 3. The ratio of preferred codons in SCCSs. X-axis represents 
the length of SCCSs, and Y-axis represents the ratio of 
preferred codon of the sequences. Classes whose sample size 
< 20 were combined. Error bars represent 1 SE. 
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Figure 4. GC content of the first (GC1 ), the second (GC2), and the 
third (GC3) position of codons in SCCSs: (A) GC1 , (B) GC2, (C) 
GC3. X-axis represents the length of SCCSs, and V-axis represents 
GC content of the sequences. Classes whose sample size < 20 
were combined. Error bars represent 1 SE. 

4. Discussion 

GO terms, InterPro codes, and KEGG pathways 
enriched with SCCS genes show a strong commitment 
to the developmental process, transcriptional regu- 
lation, and the neurons. Genes associated with tran- 
scriptional regulation or the neurons are known to 
have a low synonymous substitution ratio. This 
phenomenon is discussed in relation with codon 



3.8 

c 
o 

I 3-6 
I 

§ 3.4 
u 

3.2 



4 "M I 



H.|t|t|l||| I I 



— CN 



Sequence length 

Figure 5. Codon degeneracy of SCCSs. X-axis represents the 
length of SCCSs and Y-axis represents the averaged codon 
degeneracy of the sequences. Classes whose sample size < 20 
were combined. Error bars represent 1 SE. 



biases to improve translational efficiency or accuracy. 
However, our analysis on the preferred codons in 
SCCSs suggests that codon preference is not likely 
the major factor influencing on the conservation of 
SCCSs. 

Analyses on the ratio of preferred codons, GC 
content, and codon degeneracy enlighten the charac- 
teristics of SCCSs. The ratio of preferred codons 
decreases as the SCCS length increases. Drummond 
and Wilke 36 investigated the correlation between 
synonymous substitution rate (dS) and fraction of 
optimum (Fop) codons, and detected negative corre- 
lations between dS and Fop in rodents (mouse and 
rat) and positive correlation in human-dog compari- 
son. If the SCCSs have the same trend as the rodents of 
the previous study, the ratio of preferred codons in 
SCCSs should be high; however, our result is not. 
There was no factor that would lower nucleotide sub- 
stitution in GC content and codon degeneracy. 

Methodological difference is that our research 
focused on local and complete conservation of 
nucleotides instead of the dS in the entire region of 
a gene and that we investigated conservation among 
the six mammalian species instead of a pair-wise 
comparison. The difference of results may suggest 
that factors underlying local and strong conservation 
such as SCCS differ from the factors working on the 
gene-wide conservation. 

Makalowski et a\? 7 showed a correlation between 
synonymous substitution rate (dS) and non-synon- 
ymous substitution rate (dN). Such correlations may 
occur when the constraint on a certain nucleotide 
sequence is so strong that dN is also lowered. 
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The usage of relatively rare codons and strong local 
conservation of SCCSs may be preferable as regulatory 
signals. The fraction of SCCSs in the coding region of 
the 1 0 790 genes is 0.94%. This fraction is so small 
that it would not have an influence on conventional 
evolutionary analysis. Although the fraction is small, 
or because the fraction is small, SCCSs may have 
potential as regulatory elements. 
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